Motivation
Structure of a search engine
Ingredients
Demo
Future improvements
Max Maischein
Frankfurt.pm
Perl since 2000
Financial regulatory regimes since 2013 (EMIR, EinSiG, MIFiD II, ...)
Smart Data + data mining since 2016
Too much different information
Too little time to organize the information
Different from Google
Keep data local
Don't become part of a resultset
Google Search Appliance (too expensive)
Google Search Appliance (too expensive)
Windows Desktop Search/Cortana (Only Windows shares, no mail etc.)
Google Search Appliance (too expensive)
Windows Desktop Search/Cortana (Only Windows shares, no mail etc.)
Siri+Sherlock (Mac) (No Mac)
Google Search Appliance (too expensive)
Windows Desktop Search/Cortana (Only Windows shares, no mail etc.)
Siri+Sherlock (Mac) (No Mac)
Beagle for Linux or Ubuntu (Stopped in 2009)
Otherwise there would be no talk for me
Little time
Reuse many available building blocks
Find documents
Find linked documents
Extract text
Import text
Metadata: Text language / URL / Creation time stamp
Optimized data structure
Quick retrieval
Stemming (Find "Programs" and "Programming" when searching for "Program")
Synonyms
Query entry
Quick (!) response
Ranking
Preview of document
Crawler / Extractor (Perl+Apache Tika)
Crawler / Extractor (Perl+Apache Tika)
Index (Elasticsearch, Search::Elasticsearch)
Crawler / Extractor (Perl+Apache Tika)
Index (Elasticsearch, Search::Elasticsearch)
Search (Dancer)
1: cpanm --look Dancer::SearchApp 2: plackup -Ilib -p 5001 --host 127.0.0.1 -a bin/app.pl & 3: 4: perl -Ilib -w bin/index-filesystem.pl t\documents 5: 6: # Search
URL / id ( file:// or mail:// )
title
body
(HTML)
author
type
(file
or mail
)
File system (pdf, Text, Audio, via Apache::Tika::Async)
IMAP
ICal
HTTP (also, Plack)
Pagerank vs. Elasticsearch rank
Pagerank recognizes "Hub" pages
Every document on MY laptop is "interesting"
We can display local content in local formats
PDF (as HTML)
We can display local content in local formats
PDF (as HTML)
Mail (link to Thunderbird)
We can display local content in local formats
PDF (as HTML)
Mail (link to Thunderbird)
Music (direct link)
Extraction from Online-Content (Intranet, HTML::ContentExtractor::FTR)
More extractors (video, ...)
Metasearch actross Elasticsearch instances (Laptop in home network)
Apache Tika
https://tika.apache.org/download.html
1: http://www.apache.org/dyn/closer.cgi/tika/tika-server-1.13.jar
ElasticSearch
https://www.elastic.co/downloads/elasticsearch
1: https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/zip/elasticsearch/2.2.0/elasticsearch-2.2.0.zip
Questions?
Questions?
Dancer::SearchApp
corion@cpan.org
1: Hitman Kevin MacLeod (incompetech.com) 2: Licensed under Creative Commons: By Attribution 3.0 License 3: http://creativecommons.org/licenses/by/3.0/
Google Search Appliance image by Google Inc.
Cortana image by Microsoft Inc.
Apple Siri logo by Apple Inc.
Beagle logo by Fornax / Beagle Project
1: https://de.wikipedia.org/wiki/Datei:Beagle_Logo.svg
My own mails
Trip back to 2000
No good for public consumption
EU/ESMA produces many PDFs
I produce many Perl programs
YAPC / Act produces many calendars
The heart of the search engine
Content extraction
Much existing code
File::Find
Apache::Tika::Async for text extraction
Special extraction for mp3 and images
Done
Not good for presentation
Trip back to 2001
Start with file import
index-imap.pl